Research Synthesis Methods — Latest Matching Preprints

1

Accuracy and efficiency of using artificial intelligence for data extraction in systematic reviews. A noninferiority study within reviews

Lee, D. C. W.; O'Brien, K. M.; Presseau, J.; Yoong, S.; Lecathelinais, C.; Wolfenden, L.; Thomas, J.; Arno, A.; Hutton, B.; Hodder, R. K.

2026-02-27 public and global health 10.64898/2026.02.25.26347053 medRxiv

Top 0.1%

40.0%

Show abstract

BackgroundSystematic reviews are important for informing public health policies and program selection; however, they are time- and resource-intensive. Artificial intelligence (AI) offers a solution to reduce these labour-intensive requirements for various aspects of systematic review production, including data extraction. To date, there is limited robust evidence evaluating the accuracy and efficiency of AI for data extraction. This study within a review (SWAR) aimed to determine whether human data extraction assisted by an AI research assistant (Elicit(R)) is noninferior to human-only data extraction in terms of accuracy (i.e. agreement) and time-to-completion. Secondary aims included comparing error types and costs. MethodsA two-arm noninferiority SWAR was conducted to compare AI-assisted and human-only data extraction from 50 RCTs chronic disease interventions. Participants were randomised to extract all data required for conducting a review, using either the AI-assisted or human-only method. Accuracy was assessed using a three-point rubric by an independent assessor blinded to group allocation, based on agreement between extracted data and the assessor. Accuracy scores were standardized to a 0-100 scale. Analysis included overall and subgroup accuracy (data group and data type) using paired t-tests. Time-to-completion was self-reported by data extractors. Type of errors were coded by type and severity, and costs were calculated for data extraction, preparation of files, training and the Elicit(R) Pro subscription. ResultsThere was no difference in overall accuracy between the AI-assisted and human-only arms (mean difference (MD) 0.57 (on a 0-100 scale), 95% confidence interval (CI) -1.29, 2.43). Subgroup analysis by data group found AI-assisted to be more accurate than human-only data extraction for data variables describing intervention and control group (MD 4.75, 95% CI 2.13, 7.38), but otherwise no subgroup differences were observed. AI-assisted data extraction was significantly faster (MD 24.82 mins, 95% CI 18.80, 30.84). The AI-assisted arm made similar error types (missed or omitted data: AI-assisted 3.6%, human-only 3.4%) and severity (minor errors: AI-assisted 6.7%, human-only 6.5%) and cost $181.98 less than the human-only data extraction across the 50 studies. ConclusionAI-assisted data extraction using Elicit(R) showed noninferior accuracy, faster completion times, similar error types and severity, and lower costs compared to human-only extraction. These efficiency gains, without loss in accuracy suggest AI-assisted data extraction can replace one human-only data extractor in future systematic reviews of RCTs. Future research should explore different models of AI data extraction such as two AI-assisted extractors or AI-only extractor with human-only extractor, and comparison of AI-assisted to AI-only.

2

Collaborative large language models (LLMs) are all you need for screening in systematic reviews

Parmar, M.; Naqvi, S. A. A.; Warraich, K.; Saeidi, A.; Rawal, S.; Faisal, K. S.; Kazmi, S. Z.; Fatima, M.; He, H.; Safdar, M.; Liu, W.; Haddad, T.; Wang, Z.; Murad, M. H.; Baral, C.; Riaz, I. B.

2026-02-17 health informatics 10.64898/2026.02.07.26345640 medRxiv

Top 0.1%

23.6%

Show abstract

BackgroundThe ability of large language models (LLMs) to work collaboratively and screen studies in a systematic review (SR) is under-explored. Hence, we aimed to evaluate the effectiveness of LLMs in automating the process of screening in systematic reviews. MethodsThis is an observational study which included labeled data (title and abstracts) for five SRs. Originally, two reviewers screened the citations independently for eligibility. A third reviewer cross-checked each citation for quality assurance. GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 were used using zero-shot chain-of-thought prompting. Collaborative approaches included (i): conflict resolution using benefit of the doubt, (ii) majority voting using an independent third LLM and (iii) conflict resolution using an informed third LLM. Performance was assessed using accuracy, precision for exclusion, and recall for inclusion. Work saved over samples (WSS) was computed to estimate the reduction in manual human effort. ResultsA total of 11300 articles were included in this study. The individual models, GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 exhibited a high precision for exclusion, achieving 99.7%, 99.7%, and 99.2% and high recall for inclusion achieving 95.5%, 96.6% and 85.7%, respectively. However, the collaborative approach utilizing the two best-performing models (GPT-4 and Claude-3S) achieved an average precision of 99.9% and a recall of 98.5% (across all collaborative approaches). Furthermore, the proposed collaborative approach resulted in an average WSS of 63.5%, compared to the average WSS of 45.2% for individual models. Conversational LLM interactions showed a consistent pattern of results. LimitationsThis study was limited due to reliance on proprietary models, and evaluation on oncology datasets. ConclusionEvidence shows that collaborative LLMs enable efficient, high-performing screening in systematic reviews, supporting continuous evidence updates. Primary funding sourceNIH (U24CA265879-01-1) and Carolyn-Ann-Kennedy-Bacon Fund.

3

JARVIS, should this study be selected for full-text screening? Performance of a Joint AI-ReViewer Interactive Screening tool for systematic reviews

Barreto, G. H. C.; Burke, C.; Davies, P.; Halicka, M.; Paterson, C.; Swinton, P.; Saunders, B.; Higgins, J. P. T.

2026-04-11 health informatics 10.64898/2026.04.08.26350384 medRxiv

Top 0.1%

19.2%

Show abstract

BackgroundSystematic reviews are essential for evidence-based decision making in health sciences but require substantial time and resource for manual processes, particularly title and abstract screening. Recent advances in machine learning and large language models (LLMs) have demonstrated promise in accelerating screening with high recall but are often limited by modest gains in efficiency, mostly due to the absence of a generalisable stopping criterion. Here, we introduce and report preliminary findings on the performance of a novel semi-automated active learning system, JARVIS, that integrates LLM-based reasoning using the PICOS framework, neural networks-based classification, and human decision-making to facilitate abstract screening. MethodsDatasets containing author-made inclusion and exclusion decisions from six published systematic reviews were used to pilot the semi-automated screening system. Model performance was evaluated across recall, specificity and area under the curve precision-recall (AUC-PR), using full-text inclusion as the ground truth. Estimated workload and financial savings were calculated by comparing total screening time and reviewer costs across manual and semi-automated scenarios. ResultsAcross the six review datasets, recall ranged between 98.2% and 100%, and specificity ranged between 97.9% and 99.2% at the defined stopping point. Across iterations, AUC-PR values ranged between 83.8% and 100%. Compared with human-only screening, JARVIS delivered workload savings between 71.0% and 93.6%. When a single reviewer read the excluded records, workload savings ranged between 35.6 % and 46.8%. ConclusionThe proposed semi-automated system substantially reduced reviewer workload while maintaining high recall, improving on previously reported approaches. Further validation in larger and more varied reviews, as well as prospective testing, is warranted.

4

Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods: Protocol for an adaptive platform study within reviews

Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.

2026-04-15 health informatics 10.64898/2026.04.13.26350802 medRxiv

Top 0.1%

18.3%

Show abstract

Background: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. Methods: Members of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. Results: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. Discussion: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.

5

Transportability of missing data models across study sites for research synthesis

Thiesmeier, R.; Madley-Dowd, P.; Ahlqvist, V.; Orsini, N.

2026-03-10 epidemiology 10.64898/2026.03.09.26347913 medRxiv

Top 0.1%

14.9%

Show abstract

IntroductionSystematically missing covariates are a common challenge in medical research synthesis of quantitative data, particularly when individual participant data cannot be shared across study sites. Imputing covariate values in studies where they are systematically unobserved using information from sites where the covariate is observed implicitly assumes similarity of associations across studies. The behaviour of this assumption, and the bias arising from violating it, remains difficult to qualitatively reason about. Here, we evaluated a two-stage imputation approach for handling systematically missing covariates using simulations across a range of statistical and causal heterogeneity scenarios. MethodsWe conducted a simulation study with varying degrees of between-study heterogeneity and systematic differences in model parameters. A binary confounder was set to systematically missing in half of the studies. Study-specific effect estimates were combined using a two-stage meta-analytic model. The performance of the imputation approach was evaluated with the primary estimand being the pooled conditional confounding-adjusted exposure effect across all studies. ResultsBias in the pooled adjusted effect estimate was small across scenarios with low to substantial between-study heterogeneity. Bias increased monotonically with increasingly pronounced differences in causal structures across study sites. Coverage remained close to the nominal level under low to substantial between-study heterogeneity, but deteriorated markedly as differences in causal structures between study sites became more severe. ConclusionThe two-stage cross-site imputation approach produced valid pooled effect estimates across a wide range of simulated scenarios but showed monotonic sensitivity to differences in causal structures across studies. The results provide insight into the conditions under which cross-site imputation may be appropriate for handling systematically missing covariates in research synthesis.

6

Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets

Halpern, M.

2026-03-23 bioinformatics 10.64898/2026.02.17.706322 medRxiv

Top 0.1%

14.6%

Show abstract

BackgroundData extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. MethodsA single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. ResultsAcross five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). ConclusionsA single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling. HighlightsO_ST_ABSWhat is already knownC_ST_ABSO_LIData extraction is the primary bottleneck in meta-analysis, with single-extractor error rates of 17.7% C_LIO_LIExisting LLM-based extraction systems achieve only 26-36% accuracy on continuous outcomes C_LIO_LINo study has validated AI extraction against multiple independent datasets using formal equivalence testing C_LI What is newO_LIA single AI agent achieves statistical equivalence with human-extracted data across five agricultural meta-analyses (1,149 observations, 136 papers) C_LIO_LILLM-driven alignment resolves the previously underappreciated bottleneck of moderator matching, improving correlations from 0.377-0.812 to 0.984-0.997 without changing extracted values C_LIO_LITable-sourced observations achieve 5.5x lower error than figure-sourced data C_LI Potential impact for RSM readersO_LIProvides a validated, reproducible workflow for AI-assisted data extraction in meta-analysis C_LIO_LIDemonstrates that most apparent "extraction error" in validation studies is actually alignment error C_LIO_LIOffers practical quality signals (source-type labeling) for downstream meta-analysts C_LI

7

TrialScout links published results to trial registrations using a large language model

Ahnström, L.; Bruckner, T.; Aspromonti, D. A.; Caquelin, L.; Cummins, J.; DeVito, N. J.; Axfors, C.; Ioannidis, J. P. A.; Nilsonne, G.

2026-03-17 epidemiology 10.64898/2026.03.15.26348383 medRxiv

Top 0.1%

14.2%

Show abstract

BackgroundMultiple stakeholders need to locate results of registered clinical trials but frequently struggle to find them. Summary results of clinical trials are often not published in trial registries, and publications containing trial results are often not explicitly linked to their respective trial registrations. Finding these results is important to researchers, systematic reviewers, research funders, regulators, clinical practitioners, and patients. MethodsWe developed TrialScout, a computer program that uses a large language model to match clinical trials registered on ClinicalTrials.gov with corresponding result publications indexed in PubMed. TrialScouts performance was evaluated through comparison to human-coded matches from previous studies of results reporting rates. Subsequently, TrialScout was applied to a random sample of 9,600 completed or terminated trials. ResultsTrialScout had a sensitivity of 92.5% and a specificity of 81.2% compared to human coders. Manual review of 200 cases where TrialScout disagreed with human researchers showed that a majority (123/200, 61.5%, 95% CI, 54.4-68.3%) of disagreements were due to human errors. When used on 9,600 sampled trials in ClinicalTrials.gov, TrialScout found result publications for 6,110 (63.6%) of trials. DiscussionTrialScout reliably located results of completed clinical trials. The tool offers benefits in terms of speed and efficiency. Estimating TrialScouts accuracy is limited by the lack of a true gold standard. TrialScout can accelerate the process of locating trial results in the scientific literature and can assist in monitoring trial reporting practices.

8

Does the sensitivity- and precision-maximizing RCT filter find all 'included' records retrieved by the sensitivity-maximizing filter on Ovid MEDLINE? An investigation using 14 Cochrane reviews

Fulbright, H. A.; Marshall, D.; Evans, C.; Corbett, M.

2026-03-23 health informatics 10.64898/2026.03.20.26348876 medRxiv

Top 0.1%

12.3%

Show abstract

ObjectivesTo inform users about the impact of two updated study filters for limiting database search results to randomized controlled trials on Ovid MEDLINE: a sensitivity-maximizing version (SM) and a sensitivity-and-precision-maximizing version (SaPM). To provide an updated understanding of how they compare to each other. MethodsUsing the final included records of 14 Cochrane reviews that had used the SM filter, we determined how many available records on Ovid MEDLINE would have been retrieved with each filter; investigated why records were missed; the unique yield; precision; and number-needed-to-read (NNR) for each filter. We also performed forwards and backwards citation searching on missed records (to determine if this could mitigate the risk of missing includes) and calculated the percentage change in the overall number-needed-to-screen (ONNS) when applying each filter to reproduction strategies. ResultsOn average, the SaPM filter reduced ONNS by 83% and retrieved 95.9% of includes compared with 98.2% retrieved by the SM filter. The SaPM filter offered a further 28.2% mean reduction in ONNS over the SM filter. The SM filter had a unique yield of 12 and a precision of 1.5%, versus a unique yield of three and precision of 4.4% for the SaPM filter. NNR was 68 for the SaPM filter versus 189 for the SM filter. ConclusionThe SaPM filter reduced the screening burden with minimal risk of missing eligible records (which could be mitigated by citation searching). Decisions about which filter to use should consider both the needs and resources of the review.

9

State of play in individual participant data meta-analyses of randomised trials: Systematic review and consensus-based recommendations

Seidler, A. L.; Aagerup, J.; Nicholson, L.; Hunter, K.; Bajpai, R.; Hamilton, D.; Love, T.; Marlin, N.; Nguyen, D.; Riley, R.; Rydzewska, L.; Simmonds, M.; Stewart, L.; Tam, W.; Tierney, J.; Wang, R.; Amstutz, A.; Briel, M.; Burdett, S.; Ensor, J.; Hattle, M.; Libesman, S.; Liu, Y.; Schandelmaier, S.; Siegel, L.; Snell, K.; Sotiropoulos, J.; Vale, C.; White, I.; Williams, J.; Godolphin, P.

2026-02-04 epidemiology 10.64898/2026.02.03.26345481 medRxiv

Top 0.1%

12.2%

Show abstract

BackgroundIndividual participant data (IPD) meta-analyses obtain, harmonise and synthesise the raw individual-level data from multiple studies, and are increasingly important in an era of data sharing and personalised medicine to inform clinical practice and policy. Objectives(1) Describe the landscape of IPD meta-analysis of randomised trials over time; (2) establish current practice in design, conduct, analysis and reporting for pairwise IPD meta-analysis; and (3) derive recommendations to improve the conduct of and methods for future IPD meta-analyses. DesignPart 1: systematic review of all published IPD meta-analyses of randomised trials; Part 2: in-depth review of current methodological practice for pairwise IPD meta-analysis; and Part 3: adapted nominal group technique to derive consensus recommendations for IPD meta-analysis authors, educators and methodologists. Data sourcesMEDLINE, Embase, and the Cochrane Database of Systematic Reviews (via the Ovid interface). Eligibility criteriaPart 1: all IPD meta-analyses of randomised trials published before February 2024, evaluating intervention effects and based on a systematic search. Part 2: all pairwise IPD meta-analyses from part 1 published between February 2022 and February 2024. Part 3: Selected panel of experienced IPD meta-analysis authors and/or methodologists. ResultsPart 1: We identified 605 eligible IPD meta-analyses published between 1991 and 2024. The number of IPD meta-analyses published per year increased over time until 2019 but has since plateaued to about 60 per year. The most common clinical areas studied were cardiovascular disease (n=113, 19%) and cancer (n=110, 18%). The proportion of IPD meta-analyses published with Cochrane decreased over time from 16% (n=31/196) before 2015 to 3% (n=5/196) between 2021-2024. Part 2: 100 recent pairwise IPD meta-analyses were included in the in-depth review. Most cited PRISMA-IPD (68, 68%) and conducted risk of bias assessments (n=82, 82%), with just under half carrying out subgroup analyses not at risk of aggregation bias (n=36/85, 41%). However, only 33% (n=33) and 29% (n=29) respectively provided a protocol or statistical analysis plan, and only 7% (n=6/82) reported using IPD to inform risk of bias assessments. Part 3: 24 experts participated in a consensus workshop. Key recommendations for improved IPD meta-analyses focused on transparency (prospective registration; published protocols and statistical analysis plans) and maximising value (searching trial registries; obtaining IPD for unpublished evidence; using IPD to address missing data and risk of bias). Methodologists and educators should strengthen dissemination of methods and support capacity building across clinical fields and geographical areas. ConclusionsThe application and methodological quality of IPD meta-analyses of randomised trials has increased in the last decade, but shortcomings remain. Implementing our consensus-based recommendations will ensure future IPD meta-analyses generate better evidence for clinical decision making. Study registrationOpen Science Framework (1) Summary boxesO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIIPD meta-analyses of randomised trials are regularly used to inform clinical policy and practice. C_LIO_LIThey can provide better quality data and enable more thorough and robust analyses than standard aggregate data meta-analyses, but are resource-intensive and can be challenging to conduct, leading to variable methodological quality C_LIO_LIPrevious studies that evaluated the conduct of IPD meta-analyses pre-date several major developments, such as the introduction of the PRISMA-IPD reporting guideline. C_LI What this study addsO_LIThis is the most comprehensive assessment of IPD meta-analyses of randomised trials to date (605 studies), showing an increase in publications over time followed by a recent plateau. C_LIO_LIThe conduct of IPD meta-analysis has improved in recent years including increased use of prospective registration, assessment of risk of bias, appropriate analyses of patient subgroup effects and citing the PRISMA-IPD statement. C_LIO_LIMany shortcomings remain including (i) insufficient pre-specification of methods such as outcomes and analyses, (ii) sub-standard transparency (including publication of protocols, statistical analysis plans and reporting of analyses), and (iii) failure to gain maximum value of IPD (i.e. include unpublished trials, use the IPD to inform risk of bias and trustworthiness assessments, and address missing data appropriately); expert consensus recommendations are provided for how to address these gaps. C_LI

10

Protocol for LLM-Generated CONSORT Report for Increased Reporting: A Parallel-Arm Randomized Controlled Trial (Protocol)

Krauska, A. N.; Rohe, K.

2026-04-17 health policy 10.64898/2026.04.15.26350926 medRxiv

Top 0.1%

10.0%

Show abstract

Background Randomized controlled trials (RCTs) often have incomplete methods reporting despite widespread adoption of the CONSORT guideline. The editorial process is supposed to detect these shortcomings and request clarifications from authors, which is time-consuming. We developed an LLM-based CONSORT Rohe Nordberg Report that highlights which CONSORT items appear fully or partially reported and checks page references claimed by authors, and then creates follow up questions for authors to more easily correct missing information. Methods This parallel-arm, superiority RCT will randomize eligible RCT submissions (after desk screening) 1:1 into intervention (editorial team and authors receive the Rohe Nordberg Report) or control (standard editorial review only). The primary outcome is whether manuscripts improve their reporting of CONSORT items in the Methods and Results sections between the original submission and first revision. This will be assessed by blinded human reviewers who evaluate the textual changes for improvements between the original and revised manuscripts for each relevant CONSORT item. Secondary outcomes include time to editorial decisions, rejection and non-resubmission rates, if authors can correctly identify where CONSORT items are reported, and extent of revisions. Human evaluators will be blinded to whether the manuscript was in the intervention or control group. Discussion By providing authors and the editorial team with specific follow up questions for each underreported CONSORT item, we hypothesize that basic underreporting will be more efficiently detected and corrected. Using blinded human reviewers as the primary outcome assessors ensures a rigorous, unbiased evaluation. If successful, this approach may help align manuscripts more closely with CONSORT standards, ultimately benefiting evidence synthesis.

11

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350286 medRxiv

Top 0.1%

9.4%

Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

12

Time-to-retraction and likelihood of evidence contamination (VITALITY Extension I): a retrospective cohort analysis

Yuan, Y.; Peng, Z.; Doi, S. A. R.; Furuya-Kanamori, L.; Cao, H.; Lin, L.; Chu, H.; Loke, Y.; Mol, B. W.; Golder, S.; Vohra, S.; Xu, C.

2026-02-24 epidemiology 10.64898/2026.02.20.26346631 medRxiv

Top 0.1%

8.8%

Show abstract

BackgroundThe number of problematic randomized clinical trials (RCTs) has risen sharply in recent decades, posing serious challenges to the integrity of the healthcare evidence ecosystem. ObjectiveTo investigate whether retraction of problematic RCTs could reduce evidence contamination. DesignRetrospective cohort study SettingA secondary analysis of the VITALITY Study database. Participants1,330 retracted RCTs with 847 systematic reviews. MeasurementsThe difference in the median number (and its interquartile, IQR) of contamination before and after retraction. The association between time-to-retraction and likelihood of evidence contamination. ResultsAmong these retracted RCTs, 426 led to evidence contamination, resulting in 1,106 contamination events (251 after retraction vs. 855 before retraction). The time interval between RCT publication and first contamination ranged from 0.2 to 30.9 years, with a median of 3.3 years (95% CI: 3.0 to 3.9). The median number of contaminated systematic reviews was lower after retraction than before retraction (0, IQR: 0 to 1 vs. 1, IQR: 1 to 2, P < 0.01). Compared with trials retracted more than 7.5 years after publication, those retracted between 1.0 and 1.8 years (OR = 0.70, 95% CI: 0.60 to 0.80) and retracted within 1.0 year (OR = 0.69, 95% CI: 0.60 to 0.80) were associated with lower likelihood of evidence contamination. LimitationsOnly assessed contaminated systematic reviews with quantitative synthesis and limited to retracted RCTs. ConclusionsRetracting problematic RCTs can significantly reduce evidence contamination, and faster retraction was associated with less contamination. To safeguard the integrity of the evidence ecosystem, academic journals should act promptly in the retraction of problematic studies to minimize their downstream impact. Primary Funding SourcesThe National Natural Science Foundation of China (72204003, 72574229)

13

Limiting to English language records: A comparison of five methods on Ovid MEDLINE and Embase versus removal during screening

Fulbright, H. A.; Morrison, K.

2026-03-20 health informatics 10.64898/2026.03.18.26348470 medRxiv

Top 0.1%

8.6%

Show abstract

Background: For evidence syntheses using English language limits, several different methods and approaches are available. Objective: To understand the English language (EL) limits available on Ovid MEDLINE and Embase and the application of language metadata on these databases. To compare the impact of five EL limits versus removing non-English language (NEL) records during screening. Methods: Using the records included at full text screening or excluded on NEL status during screening in seven evidence syntheses, we tested five EL limits on 1,509 MEDLINE and 1,584 Embase records. 'Includes' removed or 'NEL excludes' retrieved were investigated. Results: All EL limits performed identically, 99.8% of MEDLINE 'includes' were retrieved versus 99.7% on Embase. All five 'includes' incorrectly removed with EL limits had language metadata errors. Although 98.2% MEDLINE and 94.6% Embase 'NEL excludes' were removed with EL limits, eight MEDLINE and nine Embase records were available in English. Discussion: The risk of excluding potentially eligible records due to language restrictions (whether applied during the strategies or screening) could be mitigated with forward and backward citation searching. Conclusion: EL limits risk removing records with incorrect language metadata. However, EL records might also be excluded on language during screening.

14

Performance of Large Language Models in Automated Medical Literature Screening: A Systematic Review and Meta-analysis

Chenggong, X.; Weichang, K.; Liuting, P.; Diaoxin, Q.; Yuxuan, Y.; Bin, W.; Liang, H.

2026-03-19 epidemiology 10.64898/2026.03.17.26348656 medRxiv

Top 0.1%

8.6%

Show abstract

ObjectiveTo systematically evaluate the diagnostic performance of large language models (LLMs) in automated medical literature screening and to determine their potential role in supporting evidence synthesis workflows. MethodsA systematic review and meta-analysis was conducted according to PRISMA DTA guidance. PubMed, Web of Science, Embase, the Cochrane Library and Google Scholar were searched from 1 January 2022 to 17 November 2025. Studies assessing LLMs for automated title and abstract screening or full-text eligibility assessment in medical literature were included. Diagnostic accuracy metrics were extracted and pooled using a bivariate random effects model and hierarchical summary receiver operating characteristic (HSROC) analysis. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity. ResultsEighteen studies published between 2023 and 2025 were included. In title and abstract screening, the pooled sensitivity was 0.92 and pooled specificity was 0.94. The SROC area under the curve (AUC) reached 0.98. In full-text screening, pooled sensitivity and specificity both reached 0.99 and the AUC was 0.99. Prompt strategies incorporating examples or chain-of-thought reasoning significantly improved sensitivity. Across studies, most models were deployed without task specific fine tuning and still achieved strong performance. Subgroup analyses and meta regression did not identify significant sources of heterogeneity. Many studies also reported substantial efficiency gains, including large reductions in screening workload, time and cost. ConclusionLLMs demonstrate high diagnostic accuracy for automated medical literature screening, particularly in full-text assessment. These models show strong potential as high sensitivity assistive tools that can substantially reduce manual screening burden while supporting evidence synthesis. Further methodological optimization and validation in large scale real-world settings are required to establish their long term role in evidence-based medicine.

15

Systematic reviews in minutes to hours using artificial intelligence

Bakker, L.; Caganek, T.; Rooprai, A.; Hume, S.

2026-02-10 health informatics 10.64898/2026.02.06.26345764 medRxiv

Top 0.1%

8.6%

Show abstract

Systematic reviews are used in academia, biotechnology, pharmaceutical companies and government to synthesise and appraise large numbers of publications. The current (largely manual) workflow takes an average of 9-18 months1, at a cost of $100,000+ per review2. We built a platform, ScholaraAI, that leverages artificial intelligence to cut this to < 0.1% of the time, without compromising quality. ScholaraAI facilitates end-to-end systematic reviews; search, screening, data extraction, and analysis. The workflow is transparent, and the researcher is in the loop. Our approach is compliant with the PRISMA and RAISE frameworks. Compared to a benchmarking set of published systematic reviews, ScholaraAIs sensitivity for correctly included studies is 100% {+/-} 0%, its specificity for correctly excluded studies is 90.8 {+/-} 8.6%, and its accuracy for data extraction is 98.0 {+/-} 3.5%. The time taken per review was 3.67 hours {+/-} 1.26. We used ScholaraAI to produce a novel, up-to-date systematic review and meta-analysis, which is presented here. ScholaraAI is free to try at app.scholara.ai.

16

Removing animal and nonhuman records in Ovid Embase: A comparison of 11 filters

Fulbright, H. A.; Evans, C.

2026-03-17 health informatics 10.64898/2026.02.13.26346239 medRxiv

Top 0.1%

8.6%

Show abstract

IntroductionSeveral filters are routinely used to remove animal or nonhuman records in Ovid Embase, despite there being no performance data for them. The filters take different approaches in design. ObjectiveTo understand and compare the impact of 11 filters to remove animal or nonhuman records in Ovid Embase. To understand the indexing of relevant subject headings in Embase. MethodsTo assess filter performance, we screened and categorised 3,000 records as should be removed or should be retained and calculated the sensitivity, specificity and overall accuracy for each filter. We reported on the focus or content of records that were incorrectly removed, using seven categories. ResultsMethod 11 was the most sensitive, correctly retaining 90.6% records, whereas method 3 had the highest specificity, correctly removing 71.5% records. Out of seven categories, those in category 1 uses human participants or data were the most excluded. DiscussionFilters that did not remove nonhuman records had higher sensitivity. Filter performance could vary by subject, publication type and language due to differences in indexing. ConclusionIn choosing a search filter, information specialists and review teams should discuss whether animals or nonhumans could feature in relevant studies.

17

Set-up, validation, evaluation, and cost-benefit analysis of an AI-assisted assessment of responsible research practices in a sample of life science publications

Kniffert, S.; Kathoefer, B.; Emprechtinger, R.; Pellegrini, P.; Funk, E. M.; Dhamrait, I. S.; Zang, Y.; Bornmueller, A.; Toelch, U.

2026-02-02 scientific communication and education 10.64898/2026.01.23.701317 medRxiv

Top 0.1%

8.4%

Show abstract

The (semi-)automated screening of publications for diverse quality and transparency criteria is at the core of systematic literature assessment. Typically, the assessment process involves two initial reviewers and one additional reviewer for cases that require reconciliation. Here, we explore to what extent this process can be assisted by Large Language Models (LLMs). Specifically, whether LLMs are capable of assessing responsible research practices (RRPs) in scientific papers in a robust way. We employed proprietary LLMs to assess an initial set of 37 papers across ten RRPs. The same papers were also reviewed by three human reviewers. We iteratively redesigned prompts to increase model accuracy compared to human ratings which we treated as the gold standard. The resulting pipeline was validated on an additional set of 15 papers. We show that LLM accuracy is comparable to single human reviewer performance (90% for LLM vs 86% for a single human reviewer). However, performance strongly depended on the specific RRPs with accuracy ranging from 40% to 100%. LLMs exhibited an affirmative bias, making more errors when practices were not reported in the papers. Overall, we show how such an approach potentially replaces one human reviewer, enabling AI-assisted assessment of research papers. We discuss how dataset imbalances, validation procedures, and implementation time limit the broad applicability of such approaches. Through this, we develop initial guidance on the utility of proprietary LLMs in evidence synthesis.

18

Development and Validation of the Transcranial Magnetic Stimulation Reporting Assessment Tool(TMS-RAT)

Szekely, O.; Holmes, N. P.; Ashton, J.; Breuer, F.; Chen, H.-Y.; Di Chiaro, N. V.; Duport, A.; Frangou, P.; Gwynne, L.; Hassan, U.; Lowe, C. J.; Mathias, B.; Peng, N.; Pepper, J. L.; Phylactou, P.; Szymanska, M. A.; Tame, L.

2026-02-19 neuroscience 10.64898/2026.02.18.706460 medRxiv

Top 0.1%

7.3%

Show abstract

HighlightsO_LIWe introduce the TMS-RAT, a reporting (assessment) tool for TMS studies C_LIO_LIDeveloped within a community-informed, iterative process rating 333 TMS studies C_LIO_LIEmpirically evaluated for usability, inter-rater, and test-retest reliability C_LIO_LIA validated subset enables reliable retrospective assessment of reporting C_LIO_LIThe modular structure enables use across a wide range of TMS study designs C_LI BackgroundA standardised tool for comprehensive reporting can improve transparency, support consistent documentation, and enable comparison across transcranial magnetic stimulation (TMS) studies. The most used reporting checklist lacks clear definitions of full reporting and was not initially evaluated for usability or inter-rater reliability. A scoping review of studies using this checklist shows that its items are reported only 50% of the time, suggesting that method descriptions are often incomplete. MethodsWe developed the TMS Reporting Assessment Tool (TMS-RAT), a comprehensive reporting framework that provides clear definitions and examples for its items, covering a wide range of TMS protocols. We tested the usability and reliability of the TMS-RAT by rating all studies published between 1991 and 2025 using afferent conditioning (n = 333), a protocol encompassing many reporting categories identified during tool development. Seventeen independent raters contributed across three development phases, a validation phase, and a retest phase, with naive raters introduced in each phase. Iterative refinements of the tool were informed by inter-rater reliability, qualitative rater feedback, and consultation with external TMS experts. ResultsWe present two versions of the tool: the 72-item TMS-RAT v1.0, designed to guide comprehensive reporting, and the TMS-RAT v1.1, a subset of 50 items with the highest inter-rater (overall AC1 = 0.78, range = [0.60-0.99]) and test-retest reliability (overall AC1 = 0.82, range = [0.65-1.0]), intended for retrospective evaluation of reporting in systematic reviews, meta-analyses. ConclusionThe TMS-RAT is a comprehensive, reliable tool that seeks to improve transparency and reproducibility in TMS research.

19

From Protocol to Analysis Plan: Development and Validation of a Large Language Model Pipeline for Statistical Analysis Plan Generation using Artificial Intelligence (SAPAI)

Jafari, H.; Chu, P.; Lange, M.; Maher, F.; Glen, C.; Pearson, O. J.; Burges, C.; Martyn, M.; Cross, S.; Carter, B.; Emsley, R.; Forbes, G.

2026-03-19 health systems and quality improvement 10.64898/2026.03.19.26348626 medRxiv

Top 0.1%

6.2%

Show abstract

Background: Statistical Analysis Plans (SAPs) are essential for trial transparency and credibility but are resource-intensive to produce. While Large Language Models (LLMs) have shown promise in drafting protocols, their ability to generate high-quality, protocol-compliant SAPs remains untested against current content guidance. This study developed and validated an LLM-based pipeline for drafting SAPs from clinical trial protocols. Methods: We developed a structured, section-by-section prompting pipeline aligned with standard SAP guidance. We applied this pipeline to nine clinical trial protocols using three leading LLMs: OpenAI GPT-5, Anthropic Claude Sonnet 4, and Google Gemini 2.5 Pro. The resulting 27 SAPs were evaluated against a 46-item quality checklist derived from the published SAP guidelines. Items were double-scored by independent trial statisticians on a 0 to 3 scale for accuracy. We compared performance across LLMs and between item types (descriptive vs. statistical reasoning) using mixed-effects logistic regression. Results: Across 9 trials, the models produced SAP drafts with high overall accuracy (77% to 78%), with no difference in performance between the three LLMs (p=0.79) but varied by content type (p < 0.001). All models performed well on descriptive items (e.g., administrative details, trial design), with lower accuracy for items requiring statistical reasoning (e.g., modelling strategies, sensitivity analyses). Accuracy for statistical items ranged from 67% to 72%, whereas descriptive items achieved 81% to 83% accuracy. Qualitatively, models were prone to specific failure modes in complex sections, such as omitting necessary details for secondary outcome models or hallucinating sensitivity analyses. Discussion: Current LLMs can effectively draft portions of SAPs, offering the potential for substantial time savings in trial documentation. However, a human-in-the-loop approach remains mandatory; while models demonstrate strong capability in producing descriptive content, their independent application to complex statistical methodology design still requires further methodological development and training. Future work should explore advanced prompt engineering, such as retrieval-augmented generation or agentic workflows, to improve reasoning capabilities.

20

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.1%

4.9%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.